Analytical method for detecting outlier evaluators

Background Epidemiologic and medical studies often rely on evaluators to obtain measurements of exposures or outcomes for study participants, and valid estimates of associations depends on the quality of data. Even though statistical methods have been proposed to adjust for measurement errors, they often rely on unverifiable assumptions and could lead to biased estimates if those assumptions are violated. Therefore, methods for detecting potential ‘outlier’ evaluators are needed to improve data quality during data collection stage. Methods In this paper, we propose a two-stage algorithm to detect ‘outlier’ evaluators whose evaluation results tend to be higher or lower than their counterparts. In the first stage, evaluators’ effects are obtained by fitting a regression model. In the second stage, hypothesis tests are performed to detect ‘outlier’ evaluators, where we consider both the power of each hypothesis test and the false discovery rate (FDR) among all tests. We conduct an extensive simulation study to evaluate the proposed method, and illustrate the method by detecting potential ‘outlier’ audiologists in the data collection stage for the Audiology Assessment Arm of the Conservation of Hearing Study, an epidemiologic study for examining risk factors of hearing loss in the Nurses’ Health Study II. Results Our simulation study shows that our method not only can detect true ‘outlier’ evaluators, but also is less likely to falsely reject true ‘normal’ evaluators. Conclusions Our two-stage ‘outlier’ detection algorithm is a flexible approach that can effectively detect ‘outlier’ evaluators, and thus data quality can be improved during data collection stage. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-01988-4.


Introduction
Many medical and epidemiological studies that investigate relationships between risk factors and disease outcomes rely on multiple evaluators (e.g. clinicians, technicians) to measure the exposures or outcomes of interest among study participants. For example, in large epidemiologic studies of hearing loss, pure-tone audiometry measurements are typically obtained by multiple audiologists or trained technicians in sound-treated booths [1][2][3]. Similarly, in large studies of vision, vision tests are often conducted by multiple evaluators in a clinic setting [4,5]. Further, potential issues related to the collection of data by multiple evaluators may also extend to studies that rely on data collected by non-human testing methods, such as automated audiometers [6], to obtain test measurements. Obtaining precise estimates of the association between risk factors and disease outcomes not only depends on the statistical methods used, but also the quality of data itself. Although many analytical methods have been proposed to adjust for measurement errors arose from data collected with poor quality, those methods typically rely on unverifiable assumptions [7], and pays a cost of the precision of estimates. Therefore, collecting data with better quality is preferred over using statistical methods to adjust for the biases induced by data of worse quality during statistical analysis stage. In this paper, we propose methods for quality control during data collection stage so that problems with the measurements of exposures or outcomes can be discovered and addressed promptly. Our work is motivated by the Conservation of Hearing Study (CHEARS), an investigation of risk factors for hearing loss among participants in the Nurses' Health Studies II (NHS II), an ongoing cohort study consisting of 116,430 registered female nurses in the US, aged 25-42 years at enrollment in 1989 [8]. The CHEARS Audiology Assessment Arm (AAA) assessed the longitudinal change in the pure-tone air and bone conduction audiometric hearing thresholds (the sound intensity of a pure tone at which it is first perceived) measured in decibels in hearing level, or dB HL, across the full range of conventional frequencies (0.5-8 kHz) [9]. Baseline testing was conducted on 3,749 women whose self-reported hearing status was either 'excellent' , 'very good' or had 'a little hearing trouble' , and resided within proximity of one of 19 CHEARS testing sites across the US [9]. The 3-year follow-up testing was completed on 3,136 participants (84%). In order to obtain reliable hearing measurements, detecting potential 'outlier' audiologists who tend to have higher or lower hearing test measurements than other audiologists is critical. Once an 'outlier' audiologist is identified, devices used by this audiologist can be examined and an early intervention can be carried out during the data collection stage if necessary. Moreover, this outlier information may have important implications for the approach of data analysis.
To the best of our knowledge, there are no existing statistical methods for detecting 'outlier' evaluators. In this paper, we develop an innovative two-stage algorithm for detecting 'outlier' evaluators. In the first stage, rather than directly evaluating the observed measurements, we extract evaluators' effects on the measurements through regression analysis where the influences of other variables can be accounted for. In the second stage, we perform hypothesis tests to detect 'outlier' evaluators based on the estimated coefficients and variances from the firststage regression analysis.
The paper is organized as follows. In Section 'Methods' , we present the two-stage algorithm to detect 'outlier' evaluators for scenarios when each study participant has either single or multiple measurements. In Section 'Simulation' , we perform a simulation study to investigate the performance of our two-stage algorithm. Section ' Application' presents a real data analysis to detect 'outlier' audiologists in the CHEARS AAA. The section 'Discussion' concludes the paper.

First stage regression
We first consider the scenario when each study participant only has one measurement to be obtained by an evaluator. Throughout the paper, we assumed that the exposure or test outcome of each study participant will be measured by only one evaluator, but one evaluator can measure multiple study participants. Let i ∈ {1, 2, . . . , N } index the study participants; j ∈ {1, 2, . . . , M} index the evaluators who measure the exposure or test outcome. Let n j denote the number of study participants who are evaluated by the j-th evaluator, such that M j=1 n j = N. To estimate the effects of evaluators on the measurements, in the first stage, we fit the following linear regression: where Y i is the measurement for the i-th study participant, T (j) i is an evaluator indicator which is 1 if the i-th study participant's exposure or outcome is evaluated by the j-th evaluator, and 0 otherwise, X i is a p-dimensional vector containing potential confounders for the evaluator-Y i relationship and predictors of Y i , and γ T is the transpose of the p-dimensional coefficient vector γ . We use T to denote the transpose of a vector or matrix throughout the paper. Without further specification, all vectors are column vectors throughout this paper. Note that the first stage regression can go beyond linearity, where some nonlinear forms of X i can be included for more accurate account of the effects of the covariates on the measurement. The regression coefficient β j represents the mean effect of evaluator j on the measurement after adjusting for X , and in the absence of 'outlier' evaluators, β j , j = 1, . . . , M , should be similar across different evaluators.
In practice, there may be multiple measurements for all or part of study participants. Let k ∈ {1, 2, . . . , t i } index the measurements for the i-th study participant. For example, in the CHEARS AAA, study participants have both ears tested by audiologists, and therefore we have t i = 2 for each participant at each frequency.
In the CHEARS AAA, the Pearson correlation coefficients between the hearing test outcomes of the left and right ear are over 0.7 regardless of frequencies.
To take into account the correlation between multiple measurements while in the meantime being able to estimate the mean effect of evaluators on the measurements after controlling for potential confounders, we propose to apply the Generalized Estimating Equations (GEE) method in the first-stage regression analysis to estimate the effects of evaluators [10,11]. The model for the multiple correlated measurements can be written as: being the unknown t i × t i variance-covariance matrix of the measurements of the i-th study participant, and Z i,k contains information that is specific to the k-th measurement of the i-th study participant.
The coefficients β 1 , . . . , β M reflect evaluators' effects on the measurements. An 'outlier' evaluator will have a different coefficient than the remaining 'normal' ones. Thus, in the second stage, we perform hypothesis tests to detect 'outlier' evaluators based on estimated β and Var( β).

Hypothesis testing
In the second stage, we detect 'outlier' evaluators who give different measurements than their counterparts after adjusting for true predictors and confounders of the outcome. We now formally define 'outlier' evaluators as those evaluators whose effects on the measurements are different from the averaged effect among all the evaluators in the study. Recall that β j , j = 1, . . . , M represents the effect of the j-th evaluator on the measurements after controlling for study participants' characteristics. 'Outlier' evaluators can be detected through testing whether evaluator effects on the measurements are statistically different from the mean effect averaged across all evaluators. Therefore, for a given evaluator j, the hypothesis can be formulated as: which can be written as β q can be interpreted as the difference between the mean measurement of the j-th evaluator and the average mean measurements over all evaluators adjusting for the characteristics of the study participants being evaluated. The test statistic of the Wald χ 2 test under the null hypothesis H 0,j is [12]: where is the estimated variance-covariance matrix of Var( β).
A more robust approach is to compute a truncated mean of the coefficients where potential 'outliers' can be prevented from contaminating the average effect. Let β (1) , β (2) , . . . , β (M) be the ordered values of the regression coefficients. A δ × 100% truncated mean can be calculated as follows [13]: where [x] denotes the integer part of x.
The null hypothesis that the j-th evaluator is not an 'outlier' is now to compare the regression coefficient of the j-th evaluator to the δ × 100% truncated mean: We refer the readers to Supplementary Material Section 1 for techincal details on constructing the design matrix L T δ×100%,j to perform hypothesis testing in (8). Since our goal is to detect as many potential 'outlier' evaluators as possible, we would like to achieve sufficient power when the evaluators are true 'outliers' . Therefore, to complete the hypothesis testing procedure, different from the traditional approach where emphasis is placed upon controlling the type-I error α at an acceptable level, we also attach importance to ensuring an appropriate level of type-II error.

Type-I error determination
Ideally, when performing hypothesis tests to detect potential 'outlier' evaluators, there is sufficient power to reject the null hypotheses H 0,j when a pre-specified alternative hypothesis H 1,j is true. Denote the pre-specified alternative hypothesis as H 1,j : L T j β = c , where c can be determined based on subject matter knowledge. For instance, in the CHEARS AAA, the 'hearing threshold' for each individual ear is measured by the lowest sound intensity of a pure-tone signal presented individually to each ear, to which the listener reliably responds, and the pure-tone signal was measured in 5-dB steps [9]. As a result, hearing loss was defined as a greater than 5-dB HL increase in the pure-tone averages of testing frequencies at low-frequency (0.5, 1, 2 kHz), mid-frequency (3, 4 kHz), and high-frequency (6, 8 kHz) [9]. Therefore, it is important to identify audiologists who consistently gave 5-dB larger or smaller hearing test results than their counterparts after controlling for study participants' characteristics. Thus, a reasonable value for the alternative hypothesis for which we hope to have sufficient power to detect is c = 5 for the CHEARS AAA. For presentational simplicity, we do not distinguish between L j and L δ×100%,j in this section, and we use L j to denote the contrast matrix of both tests.
In general, the power formula for the hypothesis test: where α is a two-sided type-I error rate, and φ is the power of the test. Under alternative hypothesis, test statistic L T j β follows a noncentral χ 2 distribution with one degree of freedom and noncentral parameter . It follows that the power of the test under the significance level α and alternative hypothesis H 1,j : To ensure sufficient power for each evaluator at a prespecified alternative hypothesis, we can first fix the power φ of the tests, and solve Eq. (10) to obtain the corresponding significance levels α j (φ) for rejecting the null hypothesis H 0,j : L T j β = 0 . Under the same power and alternative hypothesis, each evaluator has an evaluatorspecific significance level instead of a unified one due to the differences in the estimated variances of the coefficient estimates.

False discovery rate estimation
The null hypotheses that we are testing are H 0,1 , H 0,2 , . . . , H 0,M . Due to multiple testing, using a traditional significance level such as 0.05 in each test may lead to a high rate of finding 'outlier' evaluators even if they are 'normal' ones (i.e. making false discoveries) [15,16]. In our setting, since the evaluator-specific significance levels are determined by ensuring a pre-specified power of the tests, we are more likely to make false discoveries than the traditional α-level hypothesis tests when the pre-specified power is large. To protect us from falsely classifying too many 'normal' evaluators as 'outliers' , we propose to adopt the concept of the false discovery rate (FDR) [15] to control the rate of making false positive decisions.
We provide an approximation of FDR by: where Q is defined as the proportion of true null hypotheses being fasely rejected among the total rejected null hypotheses and we refer the readers to Supplementary Material Section 2 for technical details. Note that, in our approach, instead of using a unified significance level for all tests, such as α = 0.05 , each null hypothesis has its own evaluator-specific significance level such that a pre-specified power for detecting a pre-specified alternative hypothesis is achieved for all the hypothesis tests. The estimated FDR, E(Q; φ) , on the other hand, can inform us of the number of false discoveries that may be made. Therefore, when choosing an appropriate set of significance levels, apart from ensuring sufficient power for the tests, the estimated FDR can be used as another criterion reflecting our tolerance towards making false discoveries.

FDR vs. Power decision plot
As described in previous sections, for a given power, we could solve Eq. (10) to get the corresponding evaluatorspecific significance levels for rejecting the null hypotheses H 0,j , j = 1, . . . , M , and based on these significance levels, the corresponding FDR can be estimated using Eq. (11). Therefore, the relationship between power and FDR can be reflected by a decision plot where the power ( φ ) is on the x-axis, and the corresponding estimated FDR ( E(Q, φ) ) is on the y-axis. Based on the decision plot, we can pick up the significance levels at which an acceptable trade-off between power and the FDR is achieved.
We could also first select a relatively low FDR and find the corresponding power along with the evaluator-specific significance levels from the decision plot; we can then reject the null hypotheses with p-values of the tests less than the thresholds. Alternatively, if we are less concerned about making false discoveries but would like to be able to detect as many potential 'outlier' evaluators as possible, then we could first specify a relatively large power, and reject the null hypotheses by comparing the p-values with the corresponding evaluator-specific significance levels; the estimated FDR from the decision plot can inform us of the number of false discoveries we might have made.

FDR-based adjustment
We

Simulation
We perform a simulation study to assess the proposed quality control procedure for detecting 'outlier' evaluators. As a demonstration, we base our simulations on the audiometrically-assessed hearing threshold measurements at 8 kHz that were obtained in the CHEARS AAA in 2014, where 3,568 participants had assessments in both ears that were measured by 68 different licensed audiologists. Note that, the AAA was still in data collection stage in 2014, and detecting the 'outlier' audiologists would help investigators make prompt adjustment to obtain accurate measurements for tests conducted afterwards. We evaluate the performance of the proposed FDR estimator in Eq. (11), as well as true positives (successfully detecting true 'outlier' evaluators) and false positives (falsely classifying 'normal' evaluators as 'outliers') yielded by our quality control method compared with using a traditional and unified significance level such as α = 0.05 to reject the null hypotheses.

Data generation
We first consider the scenario when evaluators measure a single outcome for each study participant. We generate data based on the model below, mimicking the right ear data obtained from the CHEARS AAA: where age is generated from a normal distribution with mean 56.6 years and standard deviation (SD) 4.4; we set the 'excellent' self-reported hearing status as the reference group and the prevalences of the other two categories 'very good' and 'a little hearing trouble' were 0.44 and 0.25, respectively. These values are the same as those in the CHEARS AAA. Audio The coefficients corresponding to age, age 2 , I(very good), and I(a little hearing trouble) are set to be γ 1 = −2.7 , γ 2 = 0.03 , γ 3 = 3.3 and γ 4 = 10.3 , same as the point estimates from the regression analysis on the CHEARS data. The number of audiologists M are set to be 100, and each measures the hearing outcomes on 40 study participants. We set the coefficients as β 1 = β 2 = . . . = β 5 = 75 , β 6 = β 7 = β 8 = 70 and β 9 = β 10 = . . . = β 100 = 67 . Since the averaged audiologist effect is approximately 67, the 92 audiologists with true effect 67 are considered as 'normal' audiologists, and the 3 audiologists with effect 70 and the 5 with effect 75 are considered as true outliers. Note that, here, five 'outlier' audiologists have very different effects on the hearing test outcomes from 'normal' audiologists and three 'outlier' audiologists are slightly different from 'normal' audiologists. The values 75 and 67 are determined by the averages of the estimated regression coefficients in the regression analysis on the CHEARS data for the audiologists in the upper 10th percentile and those between the lower and upper 10th percentiles, respectively. The residual ǫ i is assumed to be normally distributed with mean 0 and standard deviation (SD) σ = 8, 10, 12 , respectively.

Simulation results
The simulation is performed for 300 replicates. Shown in Fig. 1 are the FDR vs. Power decision plots under different standard deviation (SD) of the residuals. We set the alternative hypothesis as H 1,j : L T 10%,j β = 5 . The solid curve is the estimated FDR based on Eq. (11) averaged over the 300 simulation replicates under powers ( φ ) ranging from 0.1 to 0.95 with step size of 0.01; a loess curve with the default smoothing span 0.75 is fitted to connect (12) Y i = γ 1 age i + γ 2 age 2 i + γ 3 I(very good i ) + γ 4 I(a little hearing trouble i ) + β 1 Audio the points. The dashed curve is an empirical version of the true FDR, which for each φ , is the ratio of the number of 'normal' audiologists (Audiologists 9 -100) being falsely detected as 'outlier' audiologists to the total number of detected 'outlier' audiologists, averaged over the 300 simulation replicates. The horizontal dot-dash line is the empirical version of the true FDR if we use α = 0.05 as the significance level for rejecting the null hypotheses averaged over the 300 simulation replicates. As shown in the decision plot, the estimated FDR is very close to the true FDR when σ = 8 and 10 ; while it slightly overestimate the true value when σ = 12 . Moreover, as the SD of the residual increases, the FDR also increases. For example, when σ = 8 , the FDR is less than 0.165 under power 0.95, while if σ increases to 12, the FDR is greater than 0.8 under the same power. Define the noise ratio as σ 2 Var(Y ) , which is the proportion of the variance of the residual among the total variance of the outcome measurement. The corresponding noise ratios are approximately 0.52, 0.64, and 0.72 for σ = 8, 10 and 12 . When the noise ratio increases, we are more likely to make false discoveries. Therefore, when performing quality control, including all the possible predictors and confounders in the first stage regression is crucial; this way, we can minimize the residual of the first stage regression and, as a result, minimize the FDR.
Compared with an approach that uses a fixed significance level α = 0.05 , our method enjoys more flexibility since we can choose the evaluator-specific significance levels by considering both the power and FDR. When σ = 8 , under any power, our approach has a much lower FDR than using α = 0.05 as the threshold; and when σ = 10 and 12 , even though the FDR increases, it is still smaller than the FDR if using α = 0.05 as the threshold, when the power is chosen to be less than 0.8 and 0.75, respectively. Since the goal of the method is to detecting as many potential 'outlier' evaluators as possible while making the type-I error rate under an acceptable level, we define the true positive proportion for each true 'outlier' audiologist (i.e., Audiologists 1 to 8) as the proportion of simulation replicates that correctly detect the audiologist as an 'outlier' over the 300 simulation replicates, and the false positive proportion for each true 'normal' audiologist (i.e., Audiologists 9 to 100) as the proportion of simulation replicates that falsely identify the audiologist as an 'outlier' over the 300 simulation replicates. Figure 2a  For the unadjusted procedure, as the power increases, the true positive proportions for Audiologists 1 to 5 reach to 1 quickly, which is expected since the difference between their coefficients and those of the 'normal' audiologists are set to be 8, greater than the difference used in the alternative hypothesis H 1,j : L T 10%,j β = 5 . However, for Audiologists 6 to 8, since their coefficients are only 3 larger than the 'normal' audiologists, the true positive proportions are far less than 1 even when the power is large. Compared to the approach that uses α = 0.05 as the threshold, our quality control procedure has smaller true positive proportions when the power of test is smaller than 0.3, 0.6, 0.7 for σ = 8, 10, 12 , but gradually they will increase to approximately the same or even higher level. For the 'normal' audiologists (Audiologists 9 to 16), the false positive proportions are approximately 0.05 if using α = 0.05 as the threshold. Our quality control procedure has even smaller false positive proportions when σ = 8 and 10 under nearly every power considered. When σ = 12 , the false positive proportions are still smaller than those from using α = 0.05 as the threshold, if the power is no larger than 0.9.
Compared with the unadjusted procedure, the FDRbased adjusted true positive proportions for the true 'outlier' audiologists and false positive proportions for 'normal' audiologists do not change much in the case of σ = 8 since the FDR is small, and the adjustment is minor. As σ increases, for example, when σ = 10 , the FDR is large enough to yield sufficient number of adjustments for power larger than 0.75. Apart from a decrease in the false positive proportions for the true 'normal' audiologists (Audiologists 9 to 16), we also observe a decrease in the true positive proportions for the true 'outlier' audiologists (Audiologists 1 to 8). Therefore, the ad hoc FDR-based adjustment helps to reduce the chances of making false discoveries, with a price of a reduction in the probability of making true positive decisions.
Moreover, we also conducted a simulation study for the scenarios when outcomes are correlated. The data generation process and simulation results are presented in Supplementary Material Section 1. The simulation results are similar with the single measurement scenarios; our outlier detection procedure typically has lower false positive proportions for the true 'normal' audiologists and higher true positive proportions for the true 'outlier' audiologists compared with the approach that fix the significance level at α = 0.05.

Application
To illustrate our method, we apply our method to detect 'outlier' audiologists for the audiometrically-assessed hearing threshold measurements in the CHEARS AAA collected in 2014, when the baseline testing was completed on 3,749 participants. We focus on the test results at 8 kHz. We use the GEE approach in the first stage regression analysis and we include age, age 2 , self-reported hearing status ('excellent' , ' very good' and 'a little hearing trouble'), and dummy variables for the 68 audiologists in the regression model. This regression is fitted using SAS proc genmod, assuming an exchangeable working variance-covariance structure.
We display the scatter plots of Fig. 3. Regardless of whether we are comparing with the untruncated mean or the 10% truncated mean, the plots are similar. As shown in Fig. 3a and b, Audiologist 13 has a much larger ( > 10 dB ) coefficient estimate than their counterparts, and Audiologist 4 has a much smaller ( < 10 dB ) coefficient estimate than the rest of the audiologists. Moreover, Audiologists 14, 15, 22, 47, 48, 54, 55 and 59 have a mildly different (5-10dB ) coefficient estimates from the average effect. Figure 4a to d show the FDR vs. Power decision plots, where the hypothesis tests are performed to compare each audiologist's regression coefficient with both the untruncated mean and the 10% truncated mean. We fix the alternative hypothesis as H 1,j : L T j β = 5 and 10 , and H 1,j : L T 10%,j β = 5 and 10 , respectively, for j = 1, . . . , 68 . Based on the decision plots, 'outlier' audiologists can be detected by choosing an appropriate set of significance levels that correspond to reasonable power and FDR. The results are similar between the untruncated mean and

Discussion
In this paper, we propose a novel method to address a common issue in large epidemiologic studies that rely on multiple evaluators to obtain exposure or outcome measurements to optimize data quality during data collection stage. Specifically, we developed a two-stage algorithm to detect 'outlier' evaluators, who may tend to have higher or lower measurements than those of their counterparts. In the first stage, we fit a regression model for the measurements against evaluators and study participants' characteristics that could predict the measurements. In the second stage, based on the regression coefficients in the first stage, we perform hypothesis tests to compare the mean measurement of each evaluator with the average mean measurements over all evaluators adjusting for the characteristics of the individuals evaluated. Different from the traditional hypothesis testing procedure where controlling type-I error is the primary focus, we also attach equal importance to ensuring an appropriate level of type-II error since our goal is to detect as many potential 'outlier' evaluators as possible for quality control purpose. We derive the evaluator-specific significance levels for rejecting the null hypotheses under selected powers of the tests. These significance levels are not necessarily 0.05 and are different across audiologists due to the differences in the variances of the coefficient estimates. To account for the issue of multiple comparisons, we also derive an FDR-estimator. An FDR vs. Power decision plot can be created, and based on this plot, the evaluator-specific significance levels for rejecting the null hypotheses can be determined such that both FDR and Power are acceptable. When performing hypothesis tests to detect 'outlier' evaluators, we proposed to compare the coefficient estimates to the truncated mean to prevent those 'outlier' evaluators from contaminating the estimated normal effect. Alternatively, we can consider an interval null, that is H 0 : |β i − 1 M M j=1 β j | ≤ a for some constants a > 0 . A challenge of this method might be how to select a. We will consider this method in our future research and compare it with the current method. Moreover, when calculating the evaluator-specific significance level, the knowledge of the alternative hypothesis is needed. However, if the prior knowledge is not available, we recommend performing sensitivity analysis for a series of reasonable values of the alternative hypothesis. In addition, the FDR approximation in Eq. (2) holds when the number of hypotheses (M) being conducted is large. However, when M is small, alternatively, we can use the Benjamini-Hochberg (BH) procedure to control the FDR [15]. The BH procedure proceeds by first specifying an FDR level α , and sort the null hypothesis based on p-values in ascending order ( P (1) , P (2) , . . . , P (M) ). Then the largest k such that P (k) ≤ k M α is obtained, and the first k null hypotheses will be rejected. The BH procedure can ensure that the FDR is controlled at level α . However, different from our approach, the BH procedure does not consider the power of tests and to be conservative, we might use a relatively larger α level such as 0.1 when conducting the BH procedure. There are several important points for consideration based on our work. First, an increase in the noise ratio σ 2 Var(Y ) will increase FDR, especially when the power of the test is large. Therefore, in the first stage regression, it is crucial to include all potential predictors of the measurements as regressors. Second, the proposed method assumes that the evaluator effect on the measurements is not modified by the participants' characteristics. In the case when this assumption is violated, we can estimate the evaluator effect in each category of the potential effect modifier by including the evaluator indicator-effect modifier interactions in the first stage regression model, and then we can regard the same evaluator for testing study participants in different categories of the effect modifier as if they were different evaluators. This way, an evaluator could be detected as an 'outlier' only when testing study participants in a specific category of the effect modifier. Third, to accommodate situations where the measurements are not continuous, a link function can be used in the first stage regression, such as the logit link for binary measurements, and log link for count measurements.
Our quality control procedure is used to detect potential 'outlier' evaluators and once they are detected, quality check on those evaluators should be performed to ensure future measurements can be measured accurately. However, a correction of measurement errors in existing measurements obtained by 'outlier' evaluators is beyond the scope of this paper. We will develop measurement error correction methods in future research; one idea could be to calibrate the measurements from 'outlier' evaluators to 'normal' measurements using information from the first-stage regression models, taking into account participants' characteristics.
The regular regression and GEE approach may not lead to reliable β-estimator if the numbers of study participants tested by some evaluators are small. In this case, an alternative method is to treat the measurements from the same evaluator as a cluster and to use the mixed effects model in the first stage regression analysis. In the scenario where each participant has a single measurement, this mixed effects model may include an evaluator-specific random intercept in addition to the fixed effect participants' characteristics; the estimated value of the j-th evaluator-specific intercept is β j . Similarly, in the scenario where the participants have multiple measurements, the mixed effects model may include both evaluators and participants (nested within evaluator) as random effects. Once the mixed effects model obtains β and Var( β) , the rest of the methods are the same as those stated in Subsection 'Hypothesis testing' to Subsection 'FDR-based adjustment' of this paper.
In addition to the contribution to quality control during the data collection stage of epidemiologic studies, our outlier detection method can also be valuable in clinical settings for the detection of 'outlier' evaluators (e.g. health providers or technicians); for example, clinical diagnoses often rely on measurements from evaluators, and inaccurate measurements may lead to wrong diagnoses. Furthermore, our method can be used in statistical analysis procedures. For example, for studies based on laboratory measurements of biomarkers such as plasma or urine metabolites that are measured in different batches, our method can help to identify potential 'outlier' batches, and a sensitivity analysis can be conducted by excluding those 'outlier' batches and re-estimating the parameters of interests.

Conclusions
Our two-stage algorithm is a useful method for detecting 'outlier' evaluators who tend to give higher or lower measurements than their counterparts after adjusting for study participants' characteristics. Compared with traditional hypothesis tests that focus on type-I error, we also attach importance to the type-II error so that as many potential 'outliers' can be identified, and an estimated FDR is used to control for the false positive rate. We recommend applying our method for 'outlier' detection during data collection stage to improve data quality.